Training Set Selection for Building Compact and Efficient Language Models
نویسندگان
چکیده
منابع مشابه
Building Compact N-gram Language Models Incrementally
In traditional n-gram language modeling, we collect the statistics for all n-grams observed in the training set up to a certain order. The model can then be pruned down to a more compact size with some loss in modeling accuracy. One of the more principled methods for pruning the model is the entropy-based pruning proposed by Stolcke (1998). In this paper, we present an algorithm for incremental...
متن کاملEfficient Subsampling for Training Complex Language Models
We propose an efficient way to train maximum entropy language models (MELM) and neural network language models (NNLM). The advantage of the proposed method comes from a more robust and efficient subsampling technique. The original multi-class language modeling problem is transformed into a set of binary problems where each binary classifier predicts whether or not a particular word will occur. ...
متن کاملData Selection for Compact Adapted SMT Models
Data selection is a common technique for adapting statistical translation models for a specific domain, which has been shown to both improve translation quality and to reduce model size. Selection relies on some in-domain data, of the same domain of the texts expected to be translated. Selecting the sentence-pairs that are most similar to the in-domain data from a pool of parallel texts has bee...
متن کاملTraining Set and Filters Selection for the Efficient Use of Multispectral Acquisition Systems
The quality of the results obtained from a multispectral acquisition system can be affected by several factors, including the training set on which the system characterization model relies and the optical filters that allow acquisition in different bands of the light spectrum. In this paper, we investigate the joint effect of training set and filter selection on the results of a typical multisp...
متن کاملCompact Maximum Entropy Language Models
In language modeling we are always confronted with a sparse data problem. The Maximum Entropy formalism allows to fully integrate complementary statistical properties of limited corpora. The focus of the present paper is twofold. The new smoothing technique of LM-induced marginals is introduced and discussed. We then highlight the advantages resulting from a combination of robust features and s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEICE Transactions on Information and Systems
سال: 2009
ISSN: 0916-8532,1745-1361
DOI: 10.1587/transinf.e92.d.506